January 5, 2022

What is Data Science?

  • Data science is an emerging field that combines important concepts from statistics, computer science, and substantive areas of focus.

Drew Conway’s Venn diagram of data science

  • Data science is the science of extracting meaning and information from data to change some outcome

Data Science in Public Policy…?

  • Making effective policies requires data and data evaluation

    • This is evidence-based policymaking
  • Regulation will likely require knowledge of data science tools and methods

Data Science in Public Policy…?

Source: Alex Englet/The University of Chicago

Data Science Questions?

  • Easier questions:
  1. What is spending on healthcare like in the United States?
  2. How have voting patterns for each party changed in recent years?
  • Harder questions:
  1. How do seat belt laws affect traffic fatalities? How can we reduce fatalities?
  2. Do job training programs reduce unemployment?

Data Science is a Process

Source: Doing Data Science Chapter 2

The Data Scientist’s Activity in the DS Process

Source: Doing Data Science Chapter 2

The Data Scientist’s Toolbox

Data Processing and Cleaning

  • R, Python, SQL

Exploratory Data Analysis

  • R, Python

Data Analysis

  • Statistical Models, Machine Learning Algorithms

The R Project

  • R is a programming language created by statisticians
  • Specifically designed for data analysis/statistics and data visualization
  • It’s free! Download it here: https://www.r-project.org

Python

  • Python is a programming language created by computer scientists

  • It is a general purpose language that can also be used for data analysis

  • It’s free! Download it here: https://www.python.org

SQL

  • SQL is a language for managing data bases

  • How would you deal with 20 terabytes of data that gets updated every hour?

    • Yeah, that’s why you need SQL
  • It has different dialectics, or flavors (MySQL, Oracal, Postgres

  • It’s free! Download one dialect (MySQL) here: https://mysql-com.en.softonic.com/download

Integrated Development Environments (IDEs)

Let’s say you need to write a 20 page report. How will you write it?

Integrated Development Environments (IDEs)

Integrated Development Environments (IDEs)

Integrated Development Environments (IDEs)

IDEs offer better tools to help you write code!

Examples of Clean Data

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
  1. What does each row describe?
  2. Can you tell what the values in each column represent?

Examples of Clean Data

   firm year    inv  value capital
19    1 1953 1304.4 6241.7  1777.3
20    1 1954 1486.7 5593.6  2226.3
39    2 1953  641.0 2031.3   623.6
40    2 1954  459.3 2115.5   669.7
59    3 1953  179.5 2371.6   800.3
60    3 1954  189.6 2759.9   888.9
  1. What does each row describe?
  2. Can you tell what the values in each column represent?

Examples of Messy Data

##             x         y z
## [1,] 3.378027 0.3638566 0
## [2,] 1.005682 0.7044037 1
## [3,] 5.765306 0.5332968 0
## [4,] 3.910435 0.7408176 1
## [5,] 4.189368 0.6041370 0
## [6,] 1.798274 0.5282460 1
  1. What does each row describe?
  2. Can you tell what the values in each column represent?

Examples of Messy Data

   CPO1976   DCO9679    z t
1 3.378027 0.3638566 -999 3
2 1.005682 0.7044037    1 1
3 5.765306 0.5332968 -999 6
4 3.910435 0.7408176    1 4
5 4.189368 0.6041370 -999 4
6 1.798274 0.5282460    1 2
  1. What does each row describe?
  2. Can you tell what the values in each column represent?

Messy Data is Bad

  • Data should be meaningful (or it’s not useful)

  • We need to be able to understand what the data tells us when we look at it

    • What do the rows describe?
    • What does each value mean?

Exploratory Analysis

  • Once we understand our data we can explore it

  • Let’s use an example

    • Suppose we want to know about what predicts a car’s miles per gallon in the city

    • To explore this question, we’ll use data on vehicle mpg:

# A tibble: 6 × 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

Exploratory Analysis

  • How many different manufacturers do we have in our data?

    • 15
  • What are the oldest and youngest cars in our data?

    • 1999, 2008

Exploratory Analysis

  • What are the average city and highway MPG ratings?
      cty             hwy       
 Min.   : 9.00   Min.   :12.00  
 1st Qu.:14.00   1st Qu.:18.00  
 Median :17.00   Median :24.00  
 Mean   :16.86   Mean   :23.44  
 3rd Qu.:19.00   3rd Qu.:27.00  
 Max.   :35.00   Max.   :44.00  

Exploratory Analysis

Exploratory Analysis

Exploratory Analysis

Modeling

  • In the modeling stage we estimate formal relationships between variables

  • We can also make predictions about future data

Workshop Overview

In this workshop series, we’ll work through each stage of this process

  1. Data cleaning

  2. Exploratory analysis

  3. Modeling

  • We’ll walk through each of these steps first using Stata then using R

Workshop Learning Objectives

If you give an honest effort to solve each problem in this class, I promise that you will be able to do the follow at the end of the workshop:

  1. Load a data set in Stata and R and clean it for analysis

  2. Calculate descriptive statistics and visualize your data in Stata and R

  3. Analyse data with basic modeling techniques in Stata and R

  4. Visualize the results of your model and generate predictions

Stata

  • Stata is a commercial statistical software program

  • Stata uses a simplified syntax to make it easy to use

  • Also has easy “point-and-click” options (we will avoid these)

R

  • R is an open source programming language specially designed for data science

  • R is completely free: https://www.r-project.org

  • R relies on code which makes it more complicated than Stata

  • R is much more flexible than Stata

  • To show R I’ll be using RStudio: https://rstudio.com